-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: ignore duplicated events in fingerprint validator #89332
Conversation
First commit is a slight variation of what I had done in #88961 to remove retries from the validator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @samiskin, and @srosenberg)
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 228 at r1 (raw file):
if m == nil { t.L().Printf("end of changefeed") return nil
This control escape bypasses close(cmvt.validatorDone)
. Methinks waitForValidator
would block indefinitely in this case.
To handle mixed-version CDC tests, the fingerprint validator had previously been updated to handle retries internally; this complicates the validator as the retry logic is only ever used in the mixed-version roachtests. This commit removes the retry logic from the validator and instead allows the caller to pass a `DBFunc` to be called whenever a database connection is needed after initialization. By passing a custom DBFunc, tests that need it (such as the mixed-versions roachtest) can pass a function that accounts for temporary unavailability of nodes. Specifically, we pass a function that blocks while nodes are being upgraded, to simplify reasoning of this test's behavior. In order to support that, we also change the signature of `NewFingerprintValidator` to return the actual concrete validator (which is now made public) instead of the `Validator` interface, following the generally-good approach of "accept interfaces, return structs" in Go. Release note: None
f4e74b8
to
27b214f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @samiskin, and @srosenberg)
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 228 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
This control escape bypasses
close(cmvt.validatorDone)
. MethinkswaitForValidator
would block indefinitely in this case.
validatorDone
is removed on the second commit. If that error condition did happen, the validator would hang waiting for resolved timestamps and the test would eventually timeout.
Changed the handling of this situation to call t.Fatal
to fail early and more clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @samiskin, and @srosenberg)
pkg/ccl/changefeedccl/cdctest/validator.go
line 636 at r4 (raw file):
} v.previouslySeen[row.eventKey()] = struct{}{}
What about duplicates with the same row key? Are those allowed by CDC semantics?
pkg/ccl/changefeedccl/cdctest/validator.go
line 649 at r4 (raw file):
} // row's timestamp is after the last resolved timestam; no problem
Nit: spelling timestamp
.
pkg/ccl/changefeedccl/cdctest/validator.go
line 657 at r4 (raw file):
// hasn't been seen before; that shouldn't happen if _, seen := v.previouslySeen[row.eventKey()]; !seen { return fmt.Errorf("unexpected new event at timestamp %s after resolved timestamp %s",
"unexpected out-of-order event at timestamp..."?
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 54 at r4 (raw file):
// enforcing a maximum number of operations sets a boundary on the // validator's memory usage maxOps = 12000
Do you have a sense of the memory footprint per maxOps? I am mostly just curious why 12000
is a safe value.
This commit updates the fingerprint validator (and its use in the `cdc/mixed-versions` test) to ignore duplicated events received by the validator. A previously implicit assumption of the validator is that any events that it receives are either not duplicated, or -- if they are duplicated -- they are within the previous resolved timestamp and the current resolved timestamp. However, that assumption is not justified by the changefeed guarantees and depends on how frequently `resolved` events are emitted and how often the changefeed checkpoints. In the specific case of the `cdc/mixed-versions` roachtest, it was possible for the changefeed to start from an old checkpoint (older than the last received `resolved` timestamp), causing it to re-emit old events that happened way before the previously observed resolved event. As a consequence, when the validator applies the update associated with that event, there is a mismatch with state of the original table as of the update's timestamp, as the fingerprint validator relies on the fact that updates are applied in order. To fix the issue, we now skip events that happen before the timestamp of the previous `resolved` event received. In addition, the caller can also tell the validator to verify that such out-of-order messages received by the validator have indeed been previously seen; if not, that would represent a violation of the changefeed's guarantees. Fixes: cockroachdb#87251. Release note: None
27b214f
to
f350ab7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @samiskin, and @srosenberg)
pkg/ccl/changefeedccl/cdctest/validator.go
line 636 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
What about duplicates with the same row key? Are those allowed by CDC semantics?
In any run there will be multiple updates with the same row key, but with different timestamps. Which is why this event key is a combination of row key and timestamp.
pkg/ccl/changefeedccl/cdctest/validator.go
line 649 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Nit: spelling
timestamp
.
Fixed.
pkg/ccl/changefeedccl/cdctest/validator.go
line 657 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
"unexpected out-of-order event at timestamp..."?
Updated the error message.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 54 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Do you have a sense of the memory footprint per maxOps? I am mostly just curious why
12000
is a safe value.
12000
was picked for a combination of: 1) workload finishes within the existing 30-minute timeout for the test; and 2) provides enough activity for the validator to see resolved timestamps during the different parts of the upgrade process validated in this test. I updated the comment associated with this constant with this information.
In terms of memory usage: since the workload is performing 12k operations, and each operation should translate to two keys being updated (account being transferred from, and account being transferred to), we should have ~24k entries in the previouslySeen
map (I validated this in a few runs of the test). I also wrote this benchmark [1] to approximate memory usage of a map of the same size and type of data. TLDR: the map should peak at around 4MB of memory used, which is pretty acceptable IMO.
[1] https://gist.github.com/renatolabs/d6fbe68dd4c367bac1cffa692482c4ff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, @samiskin, and @srosenberg)
pkg/ccl/changefeedccl/cdctest/validator.go
line 636 at r4 (raw file):
Previously, renatolabs (Renato Costa) wrote…
In any run there will be multiple updates with the same row key, but with different timestamps. Which is why this event key is a combination of row key and timestamp.
Right, I meant the same (row key, timestamp) tuple seen multiple times; is this valid?
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 54 at r4 (raw file):
Previously, renatolabs (Renato Costa) wrote…
12000
was picked for a combination of: 1) workload finishes within the existing 30-minute timeout for the test; and 2) provides enough activity for the validator to see resolved timestamps during the different parts of the upgrade process validated in this test. I updated the comment associated with this constant with this information.In terms of memory usage: since the workload is performing 12k operations, and each operation should translate to two keys being updated (account being transferred from, and account being transferred to), we should have ~24k entries in the
previouslySeen
map (I validated this in a few runs of the test). I also wrote this benchmark [1] to approximate memory usage of a map of the same size and type of data. TLDR: the map should peak at around 4MB of memory used, which is pretty acceptable IMO.[1] https://gist.github.com/renatolabs/d6fbe68dd4c367bac1cffa692482c4ff
Thanks for the clarification and a microbenchmark to estimate the memory footprint!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, @samiskin, and @srosenberg)
pkg/ccl/changefeedccl/cdctest/validator.go
line 636 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Right, I meant the same (row key, timestamp) tuple seen multiple times; is this valid?
Yes, it's valid, changefeed's delivery is at-least-once. An "old" event can be received by the client even after multiple later resolved timestamps are received, which is exactly what the validator couldn't handle before this PR.
bors r=srosenberg TFTR! |
Build succeeded: |
In cockroachdb#89332, we started optionally validating changefeed semantics in the `FingerprintValidator` by making sure that unseen updates are never observed if a `resolved` message is received. In order to keep bounds on the validator's memory usage, the `cdc/mixed-versions` test was setting a maximum number of operations at the workload level. However, a consequence of that change is that it creates the possibility for the workload to finish before the roachtest has received all the `resolved` events it expects (for instance, if draining takes longer than usual for some reason; a variety of other non-determinism is also at play). To deal with this possibility, we instead enforce a maximum number of previously seen events at the validator's level. For the `cdc/mixed-versions` test, we set a maximum of 100,000 previously seen events being stored (20MB memory footprint). This should be equivalent to roughly 50,000 bank transfers, and give enough time for the test to finish successfully. Epic: None. Release note: None.
89782: roachtest: limit FingerprintValidator's memory usage internally r=miretskiy,srosenberg a=renatolabs In #89332, we started optionally validating changefeed semantics in the `FingerprintValidator` by making sure that unseen updates are never observed if a `resolved` message is received. In order to keep bounds on the validator's memory usage, the `cdc/mixed-versions` test was setting a maximum number of operations at the workload level. However, a consequence of that change is that it creates the possibility for the workload to finish before the roachtest has received all the `resolved` events it expects (for instance, if draining takes longer than usual for some reason; a variety of other non-determinism is also at play). To deal with this possibility, we instead enforce a maximum number of previously seen events at the validator's level. For the `cdc/mixed-versions` test, we set a maximum of 100,000 previously seen events being stored (20MB memory footprint). This should be equivalent to roughly 50,000 bank transfers, and give enough time for the test to finish successfully. Release note: None Epic: None. 89847: sql: turn plan sampling back off by default r=j82w a=j82w The value was originally disabled by default in: #88343. The value was enabled by default: #89020 because of a bug that was fixed in #89418. This PR is disabling it by default again now that the bug has been fixed. Part Of #89847 Release note (sql change): Turn the default value of sql.metrics.statement_details.plan_collection.enabled to false. Co-authored-by: Renato Costa <[email protected]> Co-authored-by: j82w <[email protected]>
This commit updates the fingerprint validator (and its use in the
cdc/mixed-versions
test) to ignore duplicated events received by thevalidator.
A previously implicit assumption of the validator is that any events
that it receives are either not duplicated, or -- if they are
duplicated -- they are within the previous resolved timestamp and the
current resolved timestamp. However, that assumption is not justified
by the changefeed guarantees and depends on how frequently
resolved
events are emitted and how often the changefeed checkpoints.
In the specific case of the
cdc/mixed-versions
roachtest, it waspossible for the changefeed to start from an old checkpoint (older
than the last received
resolved
timestamp), causing it to re-emitold events that happened way before the previously observed resolved
event. As a consequence, when the validator applies the update
associated with that event, there is a mismatch with state of the
original table as of the update's timestamp, as the fingerprint
validator relies on the fact that updates are applied in order.
To fix the issue, we now skip events that happen before the timestamp
of the previous
resolved
event received. In addition, the caller canalso tell the validator to verify that such out-of-order messages
received by the validator have indeed been previously seen; if not,
that would represent a violation of the changefeed's guarantees.
Fixes: #87251.
Release note: None